pacman::p_load(sf,tidyverse,funModeling,blorr,corrplot,ggpubr,spdep,GWmodel,
tmap,skimr,caret)InClass_Ex05 : Build Logistic Regression to identify functional & non-functional water points in Osun state of Nigeria
1. Objective.
In this exercise, we aim to build a logistic regression model to identify ‘Functional’ & ‘Non-Functional’ water-points in Osun state of Nigeria.
1.1 Input Data Used.
Input data used for this modeling are :
Osun.rds This file contains LGAs (Local Government Authority) boundaries of Osun state. It is sf polygon data frame and
Osun_wp_sf.rds contained water points data.
1.2 Quick Notes on Logistic Regression.

2. Load required packages.
In this exercise we need packages given in the table below -
| # | Package | Function |
|---|---|---|
| 1 | sf | A package that provides simple features access for R. Mainly used for importing, managing, and processing geospatial data. |
| 2 | tidyverse | For performing data science tasks such as importing, wrangling and visualizing data. |
| 3 | funModeling | This package contains a set of functions related to exploratory data analysis, data preparation, and model performance. |
| 4 | blorr | Tool for building & validating binary logistic regression models. |
| 5 | corrplot | For creating graphical display of a correlation matrix. |
| 6 | ggpubr | For data visualization. |
| 7 | spdep | Spatial Dependence - A collection of functions to create spatial weights matrix objects from polygon contiguities. |
| 8 | skimr | Exploratory Data Analysis. |
| 9 | tmap | For choropleth map creation. |
| 10 | caret | For building machine learning package. |
| 11 | GWModel | Geographically weighted (GW) models. Building machine learning model for particular branch of spatial statistics. |
Following code chunk loads the required packages.
3. Read Input Files.
Osun_sf <- read_rds("rds\\Osun_wp_sf.rds")Osun <- read_rds("rds\\Osun.rds")summary(Osun_sf) row_id source lat_deg lon_deg
Min. : 49601 Length:4760 Min. :7.060 Min. :4.077
1st Qu.: 66875 Class :character 1st Qu.:7.513 1st Qu.:4.359
Median : 68245 Mode :character Median :7.706 Median :4.559
Mean : 68551 Mean :7.683 Mean :4.544
3rd Qu.: 69562 3rd Qu.:7.879 3rd Qu.:4.709
Max. :471319 Max. :8.062 Max. :5.055
report_date status_id water_source_clean water_source_category
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
water_tech_clean water_tech_category facility_type clean_country_name
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
clean_adm1 clean_adm2 clean_adm3 clean_adm4
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
install_year installer rehab_year rehabilitator
Min. :1917 Length:4760 Mode:logical Mode:logical
1st Qu.:2006 Class :character NA's:4760 NA's:4760
Median :2010 Mode :character
Mean :2009
3rd Qu.:2013
Max. :2015
NA's :1144
management_clean status_clean pay
Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
fecal_coliform_presence fecal_coliform_value subjective_quality
Length:4760 Min. : NA Length:4760
Class :character 1st Qu.: NA Class :character
Mode :character Median : NA Mode :character
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4760
activity_id scheme_id wpdx_id notes
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
orig_lnk photo_lnk country_id data_lnk
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road
Min. : 0.014 Min. : 0.152 Min. : 0.018
1st Qu.: 719.362 1st Qu.: 460.897 1st Qu.: 121.250
Median : 2972.784 Median : 2554.255 Median : 521.768
Mean : 5021.526 Mean : 3750.470 Mean : 1259.277
3rd Qu.: 7314.733 3rd Qu.: 5791.936 3rd Qu.: 1834.418
Max. :26909.862 Max. :19559.479 Max. :10966.271
distance_to_city distance_to_town water_point_history rehab_priority
Min. : 53.05 Min. : 30 Length:4760 Min. : 0.0
1st Qu.: 7930.75 1st Qu.: 6877 Class :character 1st Qu.: 7.0
Median :15030.41 Median :12205 Mode :character Median : 91.5
Mean :16663.99 Mean :16727 Mean : 489.3
3rd Qu.:24255.75 3rd Qu.:27739 3rd Qu.: 376.2
Max. :47934.34 Max. :44021 Max. :29697.0
NA's :2654
water_point_population local_population_1km crucialness_score
Min. : 0.0 Min. : 0 Min. :0.0001
1st Qu.: 14.0 1st Qu.: 176 1st Qu.:0.0655
Median : 119.0 Median : 1032 Median :0.1548
Mean : 513.6 Mean : 2727 Mean :0.2643
3rd Qu.: 433.2 3rd Qu.: 3717 3rd Qu.:0.3510
Max. :29697.0 Max. :36118 Max. :1.0000
NA's :4 NA's :4 NA's :798
pressure_score usage_capacity is_urban days_since_report
Min. : 0.0010 Min. : 300.0 Mode :logical Min. :1483
1st Qu.: 0.1160 1st Qu.: 300.0 FALSE:2884 1st Qu.:2688
Median : 0.4067 Median : 300.0 TRUE :1876 Median :2693
Mean : 1.4634 Mean : 560.7 Mean :2693
3rd Qu.: 1.2367 3rd Qu.:1000.0 3rd Qu.:2700
Max. :93.6900 Max. :1000.0 Max. :4645
NA's :798
staleness_score latest_record location_id cluster_size
Min. :23.13 Mode:logical Min. : 23741 Min. :1.000
1st Qu.:42.70 TRUE:4760 1st Qu.:230639 1st Qu.:1.000
Median :42.79 Median :236200 Median :1.000
Mean :42.80 Mean :235865 Mean :1.053
3rd Qu.:42.86 3rd Qu.:240061 3rd Qu.:1.000
Max. :62.66 Max. :267454 Max. :4.000
clean_country_id country_name water_source water_tech
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
status adm2 adm3 management
Mode :logical Length:4760 Length:4760 Length:4760
FALSE:2118 Class :character Class :character Class :character
TRUE :2642 Mode :character Mode :character Mode :character
adm1 New Georeferenced Column lat_deg_original
Length:4760 Length:4760 Min. : NA
Class :character Class :character 1st Qu.: NA
Mode :character Mode :character Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4760
lat_lon_deg lon_deg_original public_data_source converted
Length:4760 Min. : NA Length:4760 Length:4760
Class :character 1st Qu.: NA Class :character Class :character
Mode :character Median : NA Mode :character Mode :character
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4760
count created_timestamp updated_timestamp Geometry
Min. :1 Length:4760 Length:4760 POINT :4760
1st Qu.:1 Class :character Class :character epsg:26392 : 0
Median :1 Mode :character Mode :character +proj=tmer...: 0
Mean :1
3rd Qu.:1
Max. :1
ADM2_EN ADM2_PCODE ADM1_EN ADM1_PCODE
Length:4760 Length:4760 Length:4760 Length:4760
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Plot bar chart to understand distribution of ‘status’ field of Osun_sf data frame. Note that status field takes only 2 values. True and False.
Osun_sf %>%
freq(input = "status")Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
Please report the issue at <https://github.com/pablo14/funModeling/issues>.

status frequency percentage cumulative_perc
1 TRUE 2642 55.5 55.5
2 FALSE 2118 44.5 100.0
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(Osun)+
tm_polygons(alpha = 0.4)+
tm_shape(Osun_sf)+
tm_dots(col = "status",
alpha = 0.6)+
tm_view(set.zoom.limits = c(9,12))tmap_mode("plot")tmap mode set to plotting
4. Exploratory Data Analysis.
Here we use skim() function to understand how data is distributed in Osun_Sf dataframe.
Here are some important observations -
There are 4760 rows and 75 columns.
We see that there are many fields where ~ 20% or more values are missing. For example rehab_priority, crucialness_score, pressure_score, install_year. We conclude to drop these variables as they are not useful to create sound machine learning model - especially Logistic Reg model.
Osun_sf %>%
skim()Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
| Name | Piped data |
| Number of rows | 4760 |
| Number of columns | 75 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1.00 | 5 | 44 | 0 | 2 | 0 |
| report_date | 0 | 1.00 | 22 | 22 | 0 | 42 | 0 |
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 |
| water_source_clean | 0 | 1.00 | 8 | 22 | 0 | 3 | 0 |
| water_source_category | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| water_tech_clean | 24 | 0.99 | 9 | 23 | 0 | 3 | 0 |
| water_tech_category | 24 | 0.99 | 9 | 15 | 0 | 2 | 0 |
| facility_type | 0 | 1.00 | 8 | 8 | 0 | 1 | 0 |
| clean_country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 14 | 0 | 35 | 0 |
| clean_adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| clean_adm4 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| installer | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management_clean | 1573 | 0.67 | 5 | 37 | 0 | 7 | 0 |
| status_clean | 0 | 1.00 | 9 | 32 | 0 | 7 | 0 |
| pay | 0 | 1.00 | 2 | 39 | 0 | 7 | 0 |
| fecal_coliform_presence | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| subjective_quality | 0 | 1.00 | 18 | 20 | 0 | 4 | 0 |
| activity_id | 4757 | 0.00 | 36 | 36 | 0 | 3 | 0 |
| scheme_id | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| wpdx_id | 0 | 1.00 | 12 | 12 | 0 | 4760 | 0 |
| notes | 0 | 1.00 | 2 | 96 | 0 | 3502 | 0 |
| orig_lnk | 4757 | 0.00 | 84 | 84 | 0 | 1 | 0 |
| photo_lnk | 41 | 0.99 | 84 | 84 | 0 | 4719 | 0 |
| country_id | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| data_lnk | 0 | 1.00 | 79 | 96 | 0 | 2 | 0 |
| water_point_history | 0 | 1.00 | 142 | 834 | 0 | 4750 | 0 |
| clean_country_id | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
| country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| water_source | 0 | 1.00 | 8 | 30 | 0 | 4 | 0 |
| water_tech | 0 | 1.00 | 5 | 37 | 0 | 20 | 0 |
| adm2 | 0 | 1.00 | 3 | 14 | 0 | 33 | 0 |
| adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management | 1573 | 0.67 | 5 | 47 | 0 | 7 | 0 |
| adm1 | 0 | 1.00 | 4 | 5 | 0 | 4 | 0 |
| New Georeferenced Column | 0 | 1.00 | 16 | 35 | 0 | 4760 | 0 |
| lat_lon_deg | 0 | 1.00 | 13 | 32 | 0 | 4760 | 0 |
| public_data_source | 0 | 1.00 | 84 | 102 | 0 | 2 | 0 |
| converted | 0 | 1.00 | 53 | 53 | 0 | 1 | 0 |
| created_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| updated_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| Geometry | 0 | 1.00 | 33 | 37 | 0 | 4760 | 0 |
| ADM2_EN | 0 | 1.00 | 3 | 14 | 0 | 30 | 0 |
| ADM2_PCODE | 0 | 1.00 | 8 | 8 | 0 | 30 | 0 |
| ADM1_EN | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| ADM1_PCODE | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rehab_year | 4760 | 0 | NaN | : |
| rehabilitator | 4760 | 0 | NaN | : |
| is_urban | 0 | 1 | 0.39 | FAL: 2884, TRU: 1876 |
| latest_record | 0 | 1 | 1.00 | TRU: 4760 |
| status | 0 | 1 | 0.56 | TRU: 2642, FAL: 2118 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 68550.48 | 10216.94 | 49601.00 | 66874.75 | 68244.50 | 69562.25 | 471319.00 | ▇▁▁▁▁ |
| lat_deg | 0 | 1.00 | 7.68 | 0.22 | 7.06 | 7.51 | 7.71 | 7.88 | 8.06 | ▁▂▇▇▇ |
| lon_deg | 0 | 1.00 | 4.54 | 0.21 | 4.08 | 4.36 | 4.56 | 4.71 | 5.06 | ▃▆▇▇▂ |
| install_year | 1144 | 0.76 | 2008.63 | 6.04 | 1917.00 | 2006.00 | 2010.00 | 2013.00 | 2015.00 | ▁▁▁▁▇ |
| fecal_coliform_value | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| distance_to_primary_road | 0 | 1.00 | 5021.53 | 5648.34 | 0.01 | 719.36 | 2972.78 | 7314.73 | 26909.86 | ▇▂▁▁▁ |
| distance_to_secondary_road | 0 | 1.00 | 3750.47 | 3938.63 | 0.15 | 460.90 | 2554.25 | 5791.94 | 19559.48 | ▇▃▁▁▁ |
| distance_to_tertiary_road | 0 | 1.00 | 1259.28 | 1680.04 | 0.02 | 121.25 | 521.77 | 1834.42 | 10966.27 | ▇▂▁▁▁ |
| distance_to_city | 0 | 1.00 | 16663.99 | 10960.82 | 53.05 | 7930.75 | 15030.41 | 24255.75 | 47934.34 | ▇▇▆▃▁ |
| distance_to_town | 0 | 1.00 | 16726.59 | 12452.65 | 30.00 | 6876.92 | 12204.53 | 27739.46 | 44020.64 | ▇▅▃▃▂ |
| rehab_priority | 2654 | 0.44 | 489.33 | 1658.81 | 0.00 | 7.00 | 91.50 | 376.25 | 29697.00 | ▇▁▁▁▁ |
| water_point_population | 4 | 1.00 | 513.58 | 1458.92 | 0.00 | 14.00 | 119.00 | 433.25 | 29697.00 | ▇▁▁▁▁ |
| local_population_1km | 4 | 1.00 | 2727.16 | 4189.46 | 0.00 | 176.00 | 1032.00 | 3717.00 | 36118.00 | ▇▁▁▁▁ |
| crucialness_score | 798 | 0.83 | 0.26 | 0.28 | 0.00 | 0.07 | 0.15 | 0.35 | 1.00 | ▇▃▁▁▁ |
| pressure_score | 798 | 0.83 | 1.46 | 4.16 | 0.00 | 0.12 | 0.41 | 1.24 | 93.69 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 560.74 | 338.46 | 300.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▇▁▁▁▅ |
| days_since_report | 0 | 1.00 | 2692.69 | 41.92 | 1483.00 | 2688.00 | 2693.00 | 2700.00 | 4645.00 | ▁▇▁▁▁ |
| staleness_score | 0 | 1.00 | 42.80 | 0.58 | 23.13 | 42.70 | 42.79 | 42.86 | 62.66 | ▁▁▇▁▁ |
| location_id | 0 | 1.00 | 235865.49 | 6657.60 | 23741.00 | 230638.75 | 236199.50 | 240061.25 | 267454.00 | ▁▁▁▁▇ |
| cluster_size | 0 | 1.00 | 1.05 | 0.25 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | ▇▁▁▁▁ |
| lat_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| lon_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| count | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
We create a clean file using following chunk of code. Note than we have excluded missing values & created usage_capacity as factor.
Osun_wp_sf_clean <- Osun_sf %>%
filter_at(vars(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
usage_capacity,
is_urban,
water_source_clean),
all_vars(!is.na(.))) %>%
mutate(usage_capacity = as.factor(usage_capacity))- Note that Osun_wp_sf_clean file contains 4 less records.
summary(Osun_wp_sf_clean) row_id source lat_deg lon_deg
Min. : 49601 Length:4756 Min. :7.060 Min. :4.077
1st Qu.: 66876 Class :character 1st Qu.:7.513 1st Qu.:4.359
Median : 68245 Mode :character Median :7.706 Median :4.559
Mean : 68551 Mean :7.683 Mean :4.544
3rd Qu.: 69562 3rd Qu.:7.879 3rd Qu.:4.709
Max. :471319 Max. :8.062 Max. :5.055
report_date status_id water_source_clean water_source_category
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
water_tech_clean water_tech_category facility_type clean_country_name
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
clean_adm1 clean_adm2 clean_adm3 clean_adm4
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
install_year installer rehab_year rehabilitator
Min. :1917 Length:4756 Mode:logical Mode:logical
1st Qu.:2006 Class :character NA's:4756 NA's:4756
Median :2010 Mode :character
Mean :2009
3rd Qu.:2013
Max. :2015
NA's :1143
management_clean status_clean pay
Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
fecal_coliform_presence fecal_coliform_value subjective_quality
Length:4756 Min. : NA Length:4756
Class :character 1st Qu.: NA Class :character
Mode :character Median : NA Mode :character
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4756
activity_id scheme_id wpdx_id notes
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
orig_lnk photo_lnk country_id data_lnk
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road
Min. : 0.014 Min. : 0.152 Min. : 0.018
1st Qu.: 719.362 1st Qu.: 460.503 1st Qu.: 121.334
Median : 2968.379 Median : 2554.255 Median : 521.768
Mean : 5021.729 Mean : 3751.000 Mean : 1259.650
3rd Qu.: 7314.733 3rd Qu.: 5791.936 3rd Qu.: 1834.418
Max. :26909.862 Max. :19559.479 Max. :10966.271
distance_to_city distance_to_town water_point_history rehab_priority
Min. : 53.05 Min. : 30 Length:4756 Min. : 0.0
1st Qu.: 7930.75 1st Qu.: 6877 Class :character 1st Qu.: 7.0
Median :15020.40 Median :12215 Mode :character Median : 91.5
Mean :16662.78 Mean :16732 Mean : 489.3
3rd Qu.:24255.75 3rd Qu.:27746 3rd Qu.: 376.2
Max. :47934.34 Max. :44021 Max. :29697.0
NA's :2650
water_point_population local_population_1km crucialness_score
Min. : 0.0 Min. : 0 Min. :0.0001
1st Qu.: 14.0 1st Qu.: 176 1st Qu.:0.0655
Median : 119.0 Median : 1032 Median :0.1548
Mean : 513.6 Mean : 2727 Mean :0.2643
3rd Qu.: 433.2 3rd Qu.: 3717 3rd Qu.:0.3510
Max. :29697.0 Max. :36118 Max. :1.0000
NA's :794
pressure_score usage_capacity is_urban days_since_report
Min. : 0.0010 300 :2986 Mode :logical Min. :1483
1st Qu.: 0.1160 1000:1770 FALSE:2882 1st Qu.:2688
Median : 0.4067 TRUE :1874 Median :2693
Mean : 1.4634 Mean :2693
3rd Qu.: 1.2367 3rd Qu.:2700
Max. :93.6900 Max. :4645
NA's :794
staleness_score latest_record location_id cluster_size
Min. :23.13 Mode:logical Min. : 23741 Min. :1.000
1st Qu.:42.70 TRUE:4756 1st Qu.:230639 1st Qu.:1.000
Median :42.79 Median :236199 Median :1.000
Mean :42.80 Mean :235865 Mean :1.053
3rd Qu.:42.86 3rd Qu.:240062 3rd Qu.:1.000
Max. :62.66 Max. :267454 Max. :4.000
clean_country_id country_name water_source water_tech
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
status adm2 adm3 management
Mode :logical Length:4756 Length:4756 Length:4756
FALSE:2114 Class :character Class :character Class :character
TRUE :2642 Mode :character Mode :character Mode :character
adm1 New Georeferenced Column lat_deg_original
Length:4756 Length:4756 Min. : NA
Class :character Class :character 1st Qu.: NA
Mode :character Mode :character Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4756
lat_lon_deg lon_deg_original public_data_source converted
Length:4756 Min. : NA Length:4756 Length:4756
Class :character 1st Qu.: NA Class :character Class :character
Mode :character Median : NA Mode :character Mode :character
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :4756
count created_timestamp updated_timestamp Geometry
Min. :1 Length:4756 Length:4756 POINT :4756
1st Qu.:1 Class :character Class :character epsg:26392 : 0
Median :1 Mode :character Mode :character +proj=tmer...: 0
Mean :1
3rd Qu.:1
Max. :1
ADM2_EN ADM2_PCODE ADM1_EN ADM1_PCODE
Length:4756 Length:4756 Length:4756 Length:4756
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
5. Correlation Analysis.
Osun_wp <- Osun_wp_sf_clean %>%
select(c(7,35:39,42:43,46:47,57)) %>%
st_set_geometry(NULL) # Drop geometry5.1 Correlation Matrix.
cluster_vars.cor = cor(Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,tl.cex = 0.7,
lower = "ellipse", number.cex = 0.6,
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
Observation -
We observe that none of the variables are highly correlated. We use rule of thumb, where correlation coefficient >= 0.8 is considered as high correlation and we would recommend that such variables should not be considered for correlation.
6. Perform Logistics Regression.
In the code chunk below, we use glm() function of R to build logistic regression for the water point status.
model <- glm(status~ distance_to_primary_road+
distance_to_secondary_road+
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
is_urban+
usage_capacity+
water_source_clean+
water_point_population+
local_population_1km,
data = Osun_wp_sf_clean,
family = binomial(link = 'logit'))Here we use blorr package to generate report.
blr_regress(model) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4744 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3887 0.1124 3.4588 5e-04
distance_to_primary_road 1 0.0000 0.0000 -0.7153 0.4744
distance_to_secondary_road 1 0.0000 0.0000 -0.5530 0.5802
distance_to_tertiary_road 1 1e-04 0.0000 4.6708 0.0000
distance_to_city 1 0.0000 0.0000 -4.7574 0.0000
distance_to_town 1 0.0000 0.0000 -4.9170 0.0000
is_urbanTRUE 1 -0.2971 0.0819 -3.6294 3e-04
usage_capacity1000 1 -0.6230 0.0697 -8.9366 0.0000
water_source_cleanProtected Shallow Well 1 0.5040 0.0857 5.8783 0.0000
water_source_cleanProtected Spring 1 1.2882 0.4388 2.9359 0.0033
water_point_population 1 -5e-04 0.0000 -11.3686 0.0000
local_population_1km 1 3e-04 0.0000 19.2953 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7347 Somers' D 0.4693
% Discordant 0.2653 Gamma 0.4693
% Tied 0.0000 Tau-a 0.2318
Pairs 5585188 c 0.7347
---------------------------------------------------------------
6.1 Interpretation of the report.
Response Summary tells us that 2114 records belong to class 0 and 2642 records belong to class 1.
At 95% confidence level, variables with p-value less than 0.05 are statistically significant. These are all independent variables except
distance_to_primary_roadanddistance_to_secondary_road.Maximum Likelihood Report tells us that ‘Estimate’ column gives us correlation coefficient which ranges from -1 to +1. Please ignore correlation coefficient 1.2882 as it is for the categorical variable ‘water_source_cleanProtected Spring’ and thus it has no significance.
Similarly , water_point_population and local_population_1km are categorical variables and should not be considered for analysis where correlation co-efficient is evaluated.
For continuous variables - A positive value implies a direct correlation and a negative value implies an negative/inverse correlation. Value closer to 1 implies strong positive relation and value closer to -1 indicates strong negative correlation.
6.2 Confusion Matrix.
blr_confusion_matrix(model,cutoff = 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1301 738
1 813 1904
Accuracy : 0.6739
No Information Rate : 0.4445
Kappa : 0.3373
McNemars's Test P-Value : 0.0602
Sensitivity : 0.7207
Specificity : 0.6154
Pos Pred Value : 0.7008
Neg Pred Value : 0.6381
Prevalence : 0.5555
Detection Rate : 0.4003
Detection Prevalence : 0.5713
Balanced Accuracy : 0.6680
Precision : 0.7008
Recall : 0.7207
'Positive' Class : 1
6.3 Interpretation of Confusion Matrix.
In order to assess the overall performance of a logistic regression model, we tend to refer Misclassification Rate. The classification table above shows that there are 346 false negative and 275 false positive. The overall misclassification error is 22.06% (i.e. (738+813)/4756) = 32.61%
According to the Misclassification Rate measure, the model predicts 100 - 32.61 = 67.39 % of the water point status correctly - which is the accuracy of the model.
Let us understand True Positive Rate and True Negative Rate. See following figure for reference.
Sensitivity also known as true positive rate or recall. It answers the question, “If the model predicts a positive event, what is the probability that it really is positive?”.Our model shows that Sensitivity = 72.07%
Specificity is the true negative rate. It answer the question, “If the model predicts a negative event, what is the probability that it really is negative?”. Our model shows that Specificity = 61.54%

Metrics
7. How can we improve performance ?
Though our results are encouraging for first try however there is still lot of scope for improvement. Let us convert Simple Feature Dataframe into Spatial Point Polygon (Spatial point dataframe) version
Osun_wp_sp <- Osun_wp_sf_clean %>%
select(c(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
is_urban,
usage_capacity,
water_source_clean
)) %>%
as_Spatial()
#
Osun_wp_spclass : SpatialPointsDataFrame
features : 4756
extent : 182502.4, 290751, 340054.1, 450905.3 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs
variables : 11
names : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean
min values : 0, 0.014461356813335, 0.152195902540837, 0.017815121653488, 53.0461399623541, 30.0019777713073, 0, 0, 0, 1000, Borehole
max values : 1, 26909.8616132094, 19559.4793799085, 10966.2705628969, 47934.343603562, 44020.6393368124, 29697, 36118, 1, 300, Protected Spring
Important Note - We have now Osun_wp_sp with 4 records less. We have 4756 records instead of 4760.
8. Calculate Distance Matrix -Fixed Bandwidth.
bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
distance_to_secondary_road+
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
water_point_population+
local_population_1km+
is_urban+
usage_capacity+
water_source_clean,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE, # for fixed bandwidth
longlat = FALSE)# input data have been converted to #projected CRSTake a cup of tea and have a break, it will take a few minutes.
-----A kind suggestion from GWmodel development group
Iteration Log-Likelihood:(With bandwidth: 95768.67 )
=========================
0 -2889
1 -2836
2 -2830
3 -2829
4 -2829
5 -2829
Fixed bandwidth: 95768.67 AICc value: 5684.357
Iteration Log-Likelihood:(With bandwidth: 59200.13 )
=========================
0 -2875
1 -2818
2 -2810
3 -2808
4 -2808
5 -2808
Fixed bandwidth: 59200.13 AICc value: 5646.785
Iteration Log-Likelihood:(With bandwidth: 36599.53 )
=========================
0 -2847
1 -2781
2 -2768
3 -2765
4 -2765
5 -2765
6 -2765
Fixed bandwidth: 36599.53 AICc value: 5575.148
Iteration Log-Likelihood:(With bandwidth: 22631.59 )
=========================
0 -2798
1 -2719
2 -2698
3 -2693
4 -2693
5 -2693
6 -2693
Fixed bandwidth: 22631.59 AICc value: 5466.883
Iteration Log-Likelihood:(With bandwidth: 13998.93 )
=========================
0 -2720
1 -2622
2 -2590
3 -2581
4 -2580
5 -2580
6 -2580
7 -2580
Fixed bandwidth: 13998.93 AICc value: 5324.578
Iteration Log-Likelihood:(With bandwidth: 8663.649 )
=========================
0 -2601
1 -2476
2 -2431
3 -2419
4 -2417
5 -2417
6 -2417
7 -2417
Fixed bandwidth: 8663.649 AICc value: 5163.61
Iteration Log-Likelihood:(With bandwidth: 5366.266 )
=========================
0 -2436
1 -2268
2 -2194
3 -2167
4 -2161
5 -2161
6 -2161
7 -2161
8 -2161
9 -2161
Fixed bandwidth: 5366.266 AICc value: 4990.587
Iteration Log-Likelihood:(With bandwidth: 3328.371 )
=========================
0 -2157
1 -1922
2 -1802
3 -1739
4 -1713
5 -1713
Fixed bandwidth: 3328.371 AICc value: 4798.288
Iteration Log-Likelihood:(With bandwidth: 2068.882 )
=========================
0 -1751
1 -1421
2 -1238
3 -1133
4 -1084
5 -1084
Fixed bandwidth: 2068.882 AICc value: 4837.017
Iteration Log-Likelihood:(With bandwidth: 4106.777 )
=========================
0 -2297
1 -2095
2 -1997
3 -1951
4 -1938
5 -1936
6 -1936
7 -1936
8 -1936
Fixed bandwidth: 4106.777 AICc value: 4873.161
Iteration Log-Likelihood:(With bandwidth: 2847.289 )
=========================
0 -2036
1 -1771
2 -1633
3 -1558
4 -1525
5 -1525
Fixed bandwidth: 2847.289 AICc value: 4768.192
Iteration Log-Likelihood:(With bandwidth: 2549.964 )
=========================
0 -1941
1 -1655
2 -1503
3 -1417
4 -1378
5 -1378
Fixed bandwidth: 2549.964 AICc value: 4762.212
Iteration Log-Likelihood:(With bandwidth: 2366.207 )
=========================
0 -1874
1 -1573
2 -1410
3 -1316
4 -1274
5 -1274
Fixed bandwidth: 2366.207 AICc value: 4773.081
Iteration Log-Likelihood:(With bandwidth: 2663.532 )
=========================
0 -1979
1 -1702
2 -1555
3 -1474
4 -1438
5 -1438
Fixed bandwidth: 2663.532 AICc value: 4762.568
Iteration Log-Likelihood:(With bandwidth: 2479.775 )
=========================
0 -1917
1 -1625
2 -1468
3 -1380
4 -1339
5 -1339
Fixed bandwidth: 2479.775 AICc value: 4764.294
Iteration Log-Likelihood:(With bandwidth: 2593.343 )
=========================
0 -1956
1 -1674
2 -1523
3 -1439
4 -1401
5 -1401
Fixed bandwidth: 2593.343 AICc value: 4761.813
Iteration Log-Likelihood:(With bandwidth: 2620.153 )
=========================
0 -1965
1 -1685
2 -1536
3 -1453
4 -1415
5 -1415
Fixed bandwidth: 2620.153 AICc value: 4761.89
Iteration Log-Likelihood:(With bandwidth: 2576.774 )
=========================
0 -1950
1 -1667
2 -1515
3 -1431
4 -1393
5 -1393
Fixed bandwidth: 2576.774 AICc value: 4761.889
Iteration Log-Likelihood:(With bandwidth: 2603.584 )
=========================
0 -1960
1 -1678
2 -1528
3 -1445
4 -1407
5 -1407
Fixed bandwidth: 2603.584 AICc value: 4761.813
Iteration Log-Likelihood:(With bandwidth: 2609.913 )
=========================
0 -1962
1 -1680
2 -1531
3 -1448
4 -1410
5 -1410
Fixed bandwidth: 2609.913 AICc value: 4761.831
Iteration Log-Likelihood:(With bandwidth: 2599.672 )
=========================
0 -1958
1 -1676
2 -1526
3 -1443
4 -1405
5 -1405
Fixed bandwidth: 2599.672 AICc value: 4761.809
Iteration Log-Likelihood:(With bandwidth: 2597.255 )
=========================
0 -1957
1 -1675
2 -1525
3 -1441
4 -1403
5 -1403
Fixed bandwidth: 2597.255 AICc value: 4761.809
- AICc - Akaike Information Criterion Corrected value is 4761.809.
bw.fixed[1] 2599.672
We get the above output. We feed it into the bw argument in ggwr.basic() of GWmodel in the code chunk below.
gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road+
distance_to_secondary_road+
distance_to_city+
distance_to_town+
water_point_population+
local_population_1km+
is_urban+
usage_capacity+
water_source_clean,
data = Osun_wp_sp,
bw = bw.fixed,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE) Iteration Log-Likelihood
=========================
0 -2009
1 -1738
2 -1595
3 -1518
4 -1486
5 -1486
gwlr.fixed ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-19 07:10:45
Call:
ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road +
distance_to_city + distance_to_town + water_point_population +
local_population_1km + is_urban + usage_capacity + water_source_clean,
data = Osun_wp_sp, bw = bw.fixed, family = "binomial", kernel = "gaussian",
adaptive = FALSE, longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_primary_road distance_to_secondary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-115.519 -1.765 1.074 1.756 37.016
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 4.828e-01 1.105e-01 4.370 1.24e-05
distance_to_primary_road -9.749e-06 6.388e-06 -1.526 0.127012
distance_to_secondary_road -7.826e-06 9.264e-06 -0.845 0.398227
distance_to_city -1.244e-05 3.399e-06 -3.658 0.000254
distance_to_town -1.225e-05 2.952e-06 -4.148 3.35e-05
water_point_population -4.952e-04 4.447e-05 -11.136 < 2e-16
local_population_1km 3.404e-04 1.778e-05 19.142 < 2e-16
is_urbanTRUE -3.776e-01 7.990e-02 -4.726 2.29e-06
usage_capacity1000 -6.547e-01 6.923e-02 -9.458 < 2e-16
water_source_cleanProtected Shallow Well 4.823e-01 8.538e-02 5.649 1.61e-08
water_source_cleanProtected Spring 1.219e+00 4.380e-01 2.783 0.005393
Intercept ***
distance_to_primary_road
distance_to_secondary_road
distance_to_city ***
distance_to_town ***
water_point_population ***
local_population_1km ***
is_urbanTRUE ***
usage_capacity1000 ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5710.2 on 4745 degrees of freedom
AIC: 5732.2
Number of Fisher Scoring iterations: 5
AICc: 5732.272
Pseudo R-square value: 0.1261403
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2599.672
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -9.2644e+02 -3.7596e+00 1.9876e+00
distance_to_primary_road -1.4916e-02 -4.5206e-04 -4.6235e-05
distance_to_secondary_road -1.0417e-02 -2.7516e-04 8.9005e-05
distance_to_city -3.1672e-02 -6.1472e-04 -1.2605e-04
distance_to_town -3.1309e-02 -5.2199e-04 -1.3036e-04
water_point_population -4.2113e-02 -2.0993e-03 -9.8422e-04
local_population_1km -1.0362e-01 4.4413e-04 9.9757e-04
is_urbanTRUE -4.0082e+02 -4.1620e+00 -1.3661e+00
usage_capacity1000 -2.6050e+01 -9.9378e-01 -4.4356e-01
water_source_cleanProtected.Shallow.Well -1.9221e+02 -3.7680e-01 4.6390e-01
water_source_cleanProtected.Spring -3.8607e+02 -5.4021e+00 2.8650e+00
3rd Qu. Max.
Intercept 1.2246e+01 1656.1754
distance_to_primary_road 4.7318e-04 0.0180
distance_to_secondary_road 5.1996e-04 0.0384
distance_to_city 2.0104e-04 0.0159
distance_to_town 1.9161e-04 0.0258
water_point_population 3.0127e-04 0.1153
local_population_1km 1.7451e-03 0.0341
is_urbanTRUE 1.3146e+00 753.4661
usage_capacity1000 3.0436e-01 7.2031
water_source_cleanProtected.Shallow.Well 1.7250e+00 21.4072
water_source_cleanProtected.Spring 8.1053e+00 345.9817
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 2960.796
AIC : 4461.633
AICc : 4743.249
Pseudo R-square value: 0.5468963
***********************************************************************
Program stops at: 2022-12-19 07:11:13
To assess the performance of the gwLR, firstly, we will convert the SDF object in as data frame by using the code chunk below.
gwr.fixed <- as.data.frame(gwlr.fixed$SDF)Now, we will label yhat (predicted) values as
if yhat >= 0.5 then 1 and
if yhat < 0.5 then 0
gwr.fixed <- gwr.fixed %>%
mutate(most = ifelse(
gwr.fixed$yhat >= 0.5, T,F)
)freq(gwr.fixed$y)Warning in freq(gwr.fixed$y): All input values are NA.
NULL
freq(gwr.fixed$most)Warning in freq(gwr.fixed$most): All input values are NA.
NULL
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most,
reference= gwr.fixed$y,
positive = "TRUE" )
CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1794 289
TRUE 320 2353
Accuracy : 0.872
95% CI : (0.8621, 0.8813)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7403
Mcnemar's Test P-Value : 0.2241
Sensitivity : 0.8906
Specificity : 0.8486
Pos Pred Value : 0.8803
Neg Pred Value : 0.8613
Prevalence : 0.5555
Detection Rate : 0.4947
Detection Prevalence : 0.5620
Balanced Accuracy : 0.8696
'Positive' Class : TRUE
We have used argument positive = “TRUE”.
Accuracy = 87.2% and
Sensitivity = 89.06% and
Specificity = 84.86 %
Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
select(c(ADM2_EN, ADM2_PCODE,ADM1_EN,ADM1_PCODE, status))Now let us append gwr.fixed matrix onto osun_wp_sf_selected to produce an output simple feature object called gwr_sf.fixed using cbind() function
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)tmap_mode("view")tmap mode set to interactive viewing
actual <- tm_shape(Osun) +
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(Osun_sf) +
tm_dots(col = "status",
alpha = 0.6,
palette = "YlOrRd") +
tm_view(set.zoom.limits = c(8, 12))
prob_T <- tm_shape(Osun) +
tm_polygons(alpha = 0.4) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "yhat",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(8, 12))
tmap_arrange(actual, prob_T,
asp = 1, ncol = 2, sync = TRUE)We see that the predictions are largely aligned with the actual status of the water points
9. Visualizing Co-efficient Estimates.
The code chunk below is used to create an interactive point symbol map.
Remember yhat meaning predicted value of dependent variable Y.
tmap_mode("view")tmap mode set to interactive viewing
prob_T <- tm_shape(Osun)+
tm_polygons(alpha = 0.1)+
tm_shape(gwr_sf.fixed)+
tm_dots(col = "yhat",
border.col = 'gray60',
border.lwd = 1)+
tm_view(set.zoom.limits = c(8.5,14))
#
prob_Ttmap_mode("plot")tmap mode set to plotting
10. Employing Only Statistically Significant Variables in Global and gwLR Models.
10.1 - Drop not statistically significant variables.
As we earlier saw that 2 of the 10 variables, distance_to_primary_road and distance_to_secondary_road, are not statistically significant (p-values > 0.05), we should build logistic regression models without these 2 variables.
Hence, we repeat the relevant steps above to replicate the model building, assessment and visualisation process in the following code chunks, starting with constructing the model with only the 8 statistically significant variables.
model_refined <- glm(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = binomial(link = "logit"))
blr_regress(model_refined) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4746 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3540 0.1055 3.3541 8e-04
distance_to_tertiary_road 1 1e-04 0.0000 4.9096 0.0000
distance_to_city 1 0.0000 0.0000 -5.2022 0.0000
distance_to_town 1 0.0000 0.0000 -5.4660 0.0000
is_urbanTRUE 1 -0.2667 0.0747 -3.5690 4e-04
usage_capacity1000 1 -0.6206 0.0697 -8.9081 0.0000
water_source_cleanProtected Shallow Well 1 0.4947 0.0850 5.8228 0.0000
water_source_cleanProtected Spring 1 1.2790 0.4384 2.9174 0.0035
water_point_population 1 -5e-04 0.0000 -11.3902 0.0000
local_population_1km 1 3e-04 0.0000 19.4069 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7349 Somers' D 0.4697
% Discordant 0.2651 Gamma 0.4697
% Tied 0.0000 Tau-a 0.2320
Pairs 5585188 c 0.7349
---------------------------------------------------------------
We check and see that the remaining variables are all statistically significant to the linear regression model (p-values < 0.05).
The code chunk below calculates and displays the confusion matrix for the refined model. We will discuss the results together with that for the refined gwLR model in the subsequent subsection.
blr_confusion_matrix(model_refined, cutoff = 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1300 743
1 814 1899
Accuracy : 0.6726
No Information Rate : 0.4445
Kappa : 0.3348
McNemars's Test P-Value : 0.0761
Sensitivity : 0.7188
Specificity : 0.6149
Pos Pred Value : 0.7000
Neg Pred Value : 0.6363
Prevalence : 0.5555
Detection Rate : 0.3993
Detection Prevalence : 0.5704
Balanced Accuracy : 0.6669
Precision : 0.7000
Recall : 0.7188
'Positive' Class : 1
10.2 Determining Fixed Bandwidth for GWR Model.
bw.fixed_refined <- bw.ggwr(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE, # for fixed bandwidth
longlat = FALSE) # input data have been converted to projected CRSTake a cup of tea and have a break, it will take a few minutes.
-----A kind suggestion from GWmodel development group
Iteration Log-Likelihood:(With bandwidth: 95768.67 )
=========================
0 -2890
1 -2837
2 -2830
3 -2829
4 -2829
5 -2829
Fixed bandwidth: 95768.67 AICc value: 5681.18
Iteration Log-Likelihood:(With bandwidth: 59200.13 )
=========================
0 -2878
1 -2820
2 -2812
3 -2810
4 -2810
5 -2810
Fixed bandwidth: 59200.13 AICc value: 5645.901
Iteration Log-Likelihood:(With bandwidth: 36599.53 )
=========================
0 -2854
1 -2790
2 -2777
3 -2774
4 -2774
5 -2774
6 -2774
Fixed bandwidth: 36599.53 AICc value: 5585.354
Iteration Log-Likelihood:(With bandwidth: 22631.59 )
=========================
0 -2810
1 -2732
2 -2711
3 -2707
4 -2707
5 -2707
6 -2707
Fixed bandwidth: 22631.59 AICc value: 5481.877
Iteration Log-Likelihood:(With bandwidth: 13998.93 )
=========================
0 -2732
1 -2635
2 -2604
3 -2597
4 -2596
5 -2596
6 -2596
Fixed bandwidth: 13998.93 AICc value: 5333.718
Iteration Log-Likelihood:(With bandwidth: 8663.649 )
=========================
0 -2624
1 -2502
2 -2459
3 -2447
4 -2446
5 -2446
6 -2446
7 -2446
Fixed bandwidth: 8663.649 AICc value: 5178.493
Iteration Log-Likelihood:(With bandwidth: 5366.266 )
=========================
0 -2478
1 -2319
2 -2250
3 -2225
4 -2219
5 -2219
6 -2220
7 -2220
8 -2220
9 -2220
Fixed bandwidth: 5366.266 AICc value: 5022.016
Iteration Log-Likelihood:(With bandwidth: 3328.371 )
=========================
0 -2222
1 -2002
2 -1894
3 -1838
4 -1818
5 -1814
6 -1814
Fixed bandwidth: 3328.371 AICc value: 4827.587
Iteration Log-Likelihood:(With bandwidth: 2068.882 )
=========================
0 -1837
1 -1528
2 -1357
3 -1261
4 -1222
5 -1222
Fixed bandwidth: 2068.882 AICc value: 4772.046
Iteration Log-Likelihood:(With bandwidth: 1290.476 )
=========================
0 -1403
1 -1016
2 -807.3
3 -680.2
4 -680.2
Fixed bandwidth: 1290.476 AICc value: 5809.719
Iteration Log-Likelihood:(With bandwidth: 2549.964 )
=========================
0 -2019
1 -1753
2 -1614
3 -1538
4 -1506
5 -1506
Fixed bandwidth: 2549.964 AICc value: 4764.056
Iteration Log-Likelihood:(With bandwidth: 2847.289 )
=========================
0 -2108
1 -1862
2 -1736
3 -1670
4 -1644
5 -1644
Fixed bandwidth: 2847.289 AICc value: 4791.834
Iteration Log-Likelihood:(With bandwidth: 2366.207 )
=========================
0 -1955
1 -1675
2 -1525
3 -1441
4 -1407
5 -1407
Fixed bandwidth: 2366.207 AICc value: 4755.524
Iteration Log-Likelihood:(With bandwidth: 2252.639 )
=========================
0 -1913
1 -1623
2 -1465
3 -1376
4 -1341
5 -1341
Fixed bandwidth: 2252.639 AICc value: 4759.188
Iteration Log-Likelihood:(With bandwidth: 2436.396 )
=========================
0 -1980
1 -1706
2 -1560
3 -1479
4 -1446
5 -1446
Fixed bandwidth: 2436.396 AICc value: 4756.675
Iteration Log-Likelihood:(With bandwidth: 2322.828 )
=========================
0 -1940
1 -1656
2 -1503
3 -1417
4 -1382
5 -1382
Fixed bandwidth: 2322.828 AICc value: 4756.471
Iteration Log-Likelihood:(With bandwidth: 2393.017 )
=========================
0 -1965
1 -1687
2 -1539
3 -1456
4 -1422
5 -1422
Fixed bandwidth: 2393.017 AICc value: 4755.57
Iteration Log-Likelihood:(With bandwidth: 2349.638 )
=========================
0 -1949
1 -1668
2 -1517
3 -1432
4 -1398
5 -1398
Fixed bandwidth: 2349.638 AICc value: 4755.753
Iteration Log-Likelihood:(With bandwidth: 2376.448 )
=========================
0 -1959
1 -1680
2 -1530
3 -1447
4 -1413
5 -1413
Fixed bandwidth: 2376.448 AICc value: 4755.48
Iteration Log-Likelihood:(With bandwidth: 2382.777 )
=========================
0 -1961
1 -1683
2 -1534
3 -1450
4 -1416
5 -1416
Fixed bandwidth: 2382.777 AICc value: 4755.491
Iteration Log-Likelihood:(With bandwidth: 2372.536 )
=========================
0 -1958
1 -1678
2 -1528
3 -1445
4 -1411
5 -1411
Fixed bandwidth: 2372.536 AICc value: 4755.488
Iteration Log-Likelihood:(With bandwidth: 2378.865 )
=========================
0 -1960
1 -1681
2 -1532
3 -1448
4 -1414
5 -1414
Fixed bandwidth: 2378.865 AICc value: 4755.481
Iteration Log-Likelihood:(With bandwidth: 2374.954 )
=========================
0 -1959
1 -1679
2 -1530
3 -1446
4 -1412
5 -1412
Fixed bandwidth: 2374.954 AICc value: 4755.482
Iteration Log-Likelihood:(With bandwidth: 2377.371 )
=========================
0 -1959
1 -1680
2 -1531
3 -1447
4 -1413
5 -1413
Fixed bandwidth: 2377.371 AICc value: 4755.48
Iteration Log-Likelihood:(With bandwidth: 2377.942 )
=========================
0 -1960
1 -1680
2 -1531
3 -1448
4 -1414
5 -1414
Fixed bandwidth: 2377.942 AICc value: 4755.48
Iteration Log-Likelihood:(With bandwidth: 2377.018 )
=========================
0 -1959
1 -1680
2 -1531
3 -1447
4 -1413
5 -1413
Fixed bandwidth: 2377.018 AICc value: 4755.48
bw.fixed_refined[1] 2377.371
The output for bw.fixed_refined is given above. We will use this optimal fixed distance value for model assessment in the next subsection.
10.3 Model Assessment.
gwlr.fixed_refined <- ggwr.basic(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
bw = 2377.371,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE) Iteration Log-Likelihood
=========================
0 -1959
1 -1680
2 -1531
3 -1447
4 -1413
5 -1413
Note that we use the cleaned version of the water point sf data frame for consistency in the geometrics with our model building (4 water points with missing values excluded).
10.4 Building Fixed Bandwidth GWR Model.
bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
distance_to_secondary_road +
distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE, # for fixed bandwidth
longlat = FALSE) # input data have been converted to projected CRSTake a cup of tea and have a break, it will take a few minutes.
-----A kind suggestion from GWmodel development group
Iteration Log-Likelihood:(With bandwidth: 95768.67 )
=========================
0 -2889
1 -2836
2 -2830
3 -2829
4 -2829
5 -2829
Fixed bandwidth: 95768.67 AICc value: 5684.357
Iteration Log-Likelihood:(With bandwidth: 59200.13 )
=========================
0 -2875
1 -2818
2 -2810
3 -2808
4 -2808
5 -2808
Fixed bandwidth: 59200.13 AICc value: 5646.785
Iteration Log-Likelihood:(With bandwidth: 36599.53 )
=========================
0 -2847
1 -2781
2 -2768
3 -2765
4 -2765
5 -2765
6 -2765
Fixed bandwidth: 36599.53 AICc value: 5575.148
Iteration Log-Likelihood:(With bandwidth: 22631.59 )
=========================
0 -2798
1 -2719
2 -2698
3 -2693
4 -2693
5 -2693
6 -2693
Fixed bandwidth: 22631.59 AICc value: 5466.883
Iteration Log-Likelihood:(With bandwidth: 13998.93 )
=========================
0 -2720
1 -2622
2 -2590
3 -2581
4 -2580
5 -2580
6 -2580
7 -2580
Fixed bandwidth: 13998.93 AICc value: 5324.578
Iteration Log-Likelihood:(With bandwidth: 8663.649 )
=========================
0 -2601
1 -2476
2 -2431
3 -2419
4 -2417
5 -2417
6 -2417
7 -2417
Fixed bandwidth: 8663.649 AICc value: 5163.61
Iteration Log-Likelihood:(With bandwidth: 5366.266 )
=========================
0 -2436
1 -2268
2 -2194
3 -2167
4 -2161
5 -2161
6 -2161
7 -2161
8 -2161
9 -2161
Fixed bandwidth: 5366.266 AICc value: 4990.587
Iteration Log-Likelihood:(With bandwidth: 3328.371 )
=========================
0 -2157
1 -1922
2 -1802
3 -1739
4 -1713
5 -1713
Fixed bandwidth: 3328.371 AICc value: 4798.288
Iteration Log-Likelihood:(With bandwidth: 2068.882 )
=========================
0 -1751
1 -1421
2 -1238
3 -1133
4 -1084
5 -1084
Fixed bandwidth: 2068.882 AICc value: 4837.017
Iteration Log-Likelihood:(With bandwidth: 4106.777 )
=========================
0 -2297
1 -2095
2 -1997
3 -1951
4 -1938
5 -1936
6 -1936
7 -1936
8 -1936
Fixed bandwidth: 4106.777 AICc value: 4873.161
Iteration Log-Likelihood:(With bandwidth: 2847.289 )
=========================
0 -2036
1 -1771
2 -1633
3 -1558
4 -1525
5 -1525
Fixed bandwidth: 2847.289 AICc value: 4768.192
Iteration Log-Likelihood:(With bandwidth: 2549.964 )
=========================
0 -1941
1 -1655
2 -1503
3 -1417
4 -1378
5 -1378
Fixed bandwidth: 2549.964 AICc value: 4762.212
Iteration Log-Likelihood:(With bandwidth: 2366.207 )
=========================
0 -1874
1 -1573
2 -1410
3 -1316
4 -1274
5 -1274
Fixed bandwidth: 2366.207 AICc value: 4773.081
Iteration Log-Likelihood:(With bandwidth: 2663.532 )
=========================
0 -1979
1 -1702
2 -1555
3 -1474
4 -1438
5 -1438
Fixed bandwidth: 2663.532 AICc value: 4762.568
Iteration Log-Likelihood:(With bandwidth: 2479.775 )
=========================
0 -1917
1 -1625
2 -1468
3 -1380
4 -1339
5 -1339
Fixed bandwidth: 2479.775 AICc value: 4764.294
Iteration Log-Likelihood:(With bandwidth: 2593.343 )
=========================
0 -1956
1 -1674
2 -1523
3 -1439
4 -1401
5 -1401
Fixed bandwidth: 2593.343 AICc value: 4761.813
Iteration Log-Likelihood:(With bandwidth: 2620.153 )
=========================
0 -1965
1 -1685
2 -1536
3 -1453
4 -1415
5 -1415
Fixed bandwidth: 2620.153 AICc value: 4761.89
Iteration Log-Likelihood:(With bandwidth: 2576.774 )
=========================
0 -1950
1 -1667
2 -1515
3 -1431
4 -1393
5 -1393
Fixed bandwidth: 2576.774 AICc value: 4761.889
Iteration Log-Likelihood:(With bandwidth: 2603.584 )
=========================
0 -1960
1 -1678
2 -1528
3 -1445
4 -1407
5 -1407
Fixed bandwidth: 2603.584 AICc value: 4761.813
Iteration Log-Likelihood:(With bandwidth: 2609.913 )
=========================
0 -1962
1 -1680
2 -1531
3 -1448
4 -1410
5 -1410
Fixed bandwidth: 2609.913 AICc value: 4761.831
Iteration Log-Likelihood:(With bandwidth: 2599.672 )
=========================
0 -1958
1 -1676
2 -1526
3 -1443
4 -1405
5 -1405
Fixed bandwidth: 2599.672 AICc value: 4761.809
Iteration Log-Likelihood:(With bandwidth: 2597.255 )
=========================
0 -1957
1 -1675
2 -1525
3 -1441
4 -1403
5 -1403
Fixed bandwidth: 2597.255 AICc value: 4761.809
10.5 Conclusion
We see that the model accuracy and specificity improve very slightly by removing the non-statistically significant variables from the gwLR model, but the sensitivity drops slightly.